Validating Hyperparameters





Kerry Back

Hyperparameters

  • The max depth of trees in a forest and the number of trees are called hyperparameters.
  • Hyperparameter means that they are specified ex ante rather than calculated through fitting.
  • The hidden layer sizes in a neural net are also hyperparameters.

Overfitting

  • Hyperparameters control how complex the model is.
  • More complex models will better fit the training data.
  • But we risk overfitting.
    • Overfitting means fitting our model to random peculiarities of the training data.
    • An overfit model will not work well on new data.
  • So more complexity is not necessarily better.

Validation

  • Reserve some data called validation data.
  • Train with different hyperparameters on training data that does not include validation data.
  • Choose hyperparameters that perform best on validation data.

Cross validation

  • Split data into, for example, 3 sets of equal size, say A, B, and C.
  • Train on A \(\cup\) B, assess performance on C
  • Train on A \(\cup\) C, assess performance on B
  • Train on B \(\cup\) C, assess performance on A
  • Choose hyperparameters with best average performance on A, B, and C.

Grid Search CV

from sklearn.model_selection import GridSearchCV
  • Pass a model or pipeline to GridSearchCV without specifying the hyperparameters.
  • Pass a set (“grid”) of hyperparameters to evaluate.
  • Fit the GridSearchCV.

Everything in one step

Fitting GridSearchCV does all of the following:

  • Randomly choose the subsets A, B, and C (default is 5 subsets rather than 3).
  • Fit the model or pipeline on training sets and evaluate on validation sets.
  • Choose hyperparameters with best average performance.
  • Refit the model on the entire dataset using the best hyperparameters.

Random forest example

  • roeq, mom12m, and rnk for 2021-01 as before
  • Define model without specifying max depth.
model = RandomForestRegressor(
  random_state=0
)
  • Define pipeline as before
pipe = make_pipeline(
  transform,
  poly,
  transform,
  model
)

Define parameters to evaluate

  • Example: evaluate depths of 4, 6, and 8.
  • Specify the part of the pipeline that the hyperparameters belong to.
    • Name in lowercase.
    • Double underscore between name and parameter name.
param_grid = {
    "randomforestregressor__max_depth": 
    [4, 6, 8]
}

Fit and save

cv = GridSearchCV(
  pipe, 
  param_grid=param_grid
)

X = data[["roeq", "mom12m"]]
y = data["rnk"]

cv.fit(X, y)
dump(cv, "forest2.joblib")


Later:

forest = load("forest2.joblib")